The Model

Nearest neighbors does not make any mathematical assumptions or require a lot of dependencies. It only requires a way to calculate distance and the assumption that proximate points are similar. One downside is that nearest neighbors does not take all of the information ino account--it only makes predictions based on a handful of points closest to the one of interest.


In [1]:
%%capture
%run "4 - Linear Algebra.ipynb"

import numpy as np
from collections import Counter
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
def majority_vote(labels):
    """assumes that labels are ordered from nearest to farthest"""
    vote_counts = Counter(labels)
    winner, winner_count = vote_counts.most_common(1)[0]
    num_winners = len([count for count in vote_counts.values() if count == winner_count])
    
    if num_winners == 1:
        return winner # unique winner, so return it
    else:
        return majority_vote(labels[:-1]) # try again without the farthest

def knn_classify(k, labeled_points, new_point):
    """each labeled point should be a pair (point, label)"""
    
    def key_func(point):
        p, _ = point
        return distance(p, new_point)
    
    # order the labeled points from nearest to farthest
    by_distance = sorted(labeled_points, key=key_func)
    
    # find the labels for the k closest
    k_nearest_labels = [label for _, label in by_distance[:k]]
    
    # and let them vote
    return majority_vote(k_nearest_labels)

In [3]:
from sklearn import datasets

iris = datasets.load_iris()

iris_0 = iris.data[iris.target == 0]
iris_1 = iris.data[iris.target == 1]
iris_2 = iris.data[iris.target == 2]

plt.scatter(iris_0[:, 0], iris_0[:, 1], label='0');
plt.scatter(iris_1[:, 0], iris_1[:, 1], label='1');
plt.scatter(iris_2[:, 0], iris_2[:, 1], label='2');
plt.legend();



In [4]:
# Classify a point that would lie at 6, 3
knn_classify(1, zip(iris.data[:, :2], iris.target), [6, 3])


Out[4]:
2

In [5]:
# try several different values for k
for k in [1, 3, 5, 7, 15]:
    num_correct = 0
    num_total = 0
        
    for x in zip(iris.data[:, :2], iris.target):
        num_total += 1
        point, target = x
        other_iris = [(p, t) for (p, t) in zip(iris.data[:, :2], iris.target) if p[0] != point[0] and p[1] != point[1]]
        
        predicted_iris = knn_classify(k, other_iris, point)
        if predicted_iris == target: num_correct += 1
    
    print(k, "neighbor[s]:", num_correct, "correct out of", num_total)


1 neighbor[s]: 114 correct out of 150
3 neighbor[s]: 118 correct out of 150
5 neighbor[s]: 119 correct out of 150
7 neighbor[s]: 121 correct out of 150
15 neighbor[s]: 122 correct out of 150

You can use k-nearest to predict on novel data. Here is how our model makes prediction for the three possible target values:


In [6]:
k = 1

for x in np.arange(4, 8, 0.15):
    for y in np.arange(1, 5, 0.15):
        predicted_iris = knn_classify(k, zip(iris.data[:, :2], iris.target), [x, y])
        color = ''
        if predicted_iris == 0:
            color = 'blue'
        elif predicted_iris == 1:
            color = 'orange'
        else:
            color = 'green'
        plt.scatter(x, y, color=color)


The Curse Of Dimensionality

k-nearest neighbors can perform poorly on high-dimension data sets. This is due to the fact that points are not close together when data is high-dimensional.


In [ ]: